Overview

Alcohol consumption causes health loss and economic cost in the United States. My project analyze the impact of geography and median household income on drinking prevalence. While it is impossible to fully stop populational alcohol use, analyzing factors which can potentially affect alcohol consumption helps develop preventive strategies. The goal of my project is to explore the potential relationship between alcohol use prevalence vs. geography and median household income. I use the dataset from GHDx named United States Alcohol Use Prevalence by County 2002-2012 to get information of the alcohol pattern across the U.S. (by year and sex). Additionally, I use a map shapefile and median household income data from the American Community Survey (ACS). In addition to the shapefile, I also get the coordinates for the center of each states from Google Developer.

My Github Repository

Faculty/Staff: Dr. Bomyi Lim, Sherrie Xie (Graduate student), Dr. Blanca Himes

Dr. Bomyi Lim helped shape my idea. Sherrie Xie (Ph.D. student) helped with maps and the correlation test Dr. Blanca Himes helped with maps and notified some limitations of my study.

Introduction

There are globally 39% males and 25% females who are current drinkers. These percentages correspond to 1.5 billion males and 0.9 billion females. The alcohol consumption has a complex association with health outcomes. It has been linked to 60 acute and chronic diseases, such as liver cancers, alcohol use disorders, and hypertension. Although several previous research suggests that moderate level of alcohol consumption brings benefit to humans, this finding has resently been questioned. As a matter of fact, a new study published in 2018 (GBD 2016 Alcohol Collaborators, 2018) states that the level of alcohol consumption which minimize harm on human health is zero. Negative effect of alcohol on health outcomes raise with the increase in the level of consumption. Additionally, alcohol consumption is the 7th leading risk factor for both death and disability all over the world. It led to 2.8 million deaths in 2016. In addition to health loss, the alcohol use also has an impact on the economy. The cost of excessive alcohol use in the United States reached 249 billion dollars in 2010, which can be counted as 2.05 dollars per drink on average, or $807 per person annually. 77% of these costs were due to binge drinking. In addition to that, 2 of every 5 dollars were paid by federal, state, and local governments toward losses in workplace productivity, health care, criminal justice, and motor vehicle crashes.

While many research focus on the cause of alcohol consumption, few of them address the factors that affect alcohol consumption behavior. Therefore, while it is impossible to fully stop populational alcohol use, analyzing factors which can potentially affect alcohol consumption helps develop preventive strategies such as changing public health policies, increase resources for addiction services, and implementing pricing strategies to increase the price of alcohol to limit alcohol consumption. In this project, I specifically focus on whether geographic location and median household income have impact on the drinking prevalence.

Methods

I use the dataset from GHDx named United States Alcohol Use Prevalence by County 2002-2012 to get information of the alcohol pattern across the United States. The dataset provides alcohol use prevalence estimates by year and sex form 2002 to 2012. It also divides alcohol consumption degree into three levels: any drinking means at least one drink of any alcoholic beverage in the past 30 days; heavy drinking is more than one drink per day for women or two drinks per day for men; and binge drinking stands for more than four drinks for women or five drinks for men on a single occasion at least once. I use map shapefiles and median household income census from the American Community Survey (ACS). Additionally, I get the coordinates for the center of each states from Google Developer for the purpose of drawing dots on maps. My project uses t-tests to verify the difference in drinking prevalence change between males and females. Static maps are utilized to visualize geographic information as well as the combination of geographic information and median household income. In addition, Spearman’s rank correlation tests are used to verify the relationship between median household income and drinking prevalence.

## 
## The downloaded binary packages are in
##  /var/folders/6c/wrs0172s3dzg403t4945383w0000gn/T//RtmplNlHBe/downloaded_packages
## 
## The downloaded binary packages are in
##  /var/folders/6c/wrs0172s3dzg403t4945383w0000gn/T//RtmplNlHBe/downloaded_packages
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## ── Attaching packages ─────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ tibble  2.1.3     ✔ purrr   0.3.3
## ✔ tidyr   1.0.0     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0
## ── Conflicts ────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## Linking to GEOS 3.7.2, GDAL 2.4.2, PROJ 5.2.0
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths

Data is imported and cleaned up preliminarily.

First, I compare national change of alcohol use (by sex)

## Warning: Ignoring unknown parameters: stat

## Warning: Ignoring unknown parameters: stat

The data for heavy drinking from 2002 to 2004 is not available. Therefore, I perform all my later study from 2005 to 2012. The plot shows that the prevalence of any drinking did not change over the 8 years. There is a decrease in males but increase in females in any drinking. In addition to that, the prevalence of both heavy and binge drinking increased. The change in binge drinking is larger than the change in heavy drinking, while the increase in female drinking prevalence in both heavy and binge drinking are higher than the male drinking prevalence. Then, I used box plots and t-tests to verify this difference between male and female.

## No id variables; using all as measure variables
## No id variables; using all as measure variables

## 
##  Welch Two Sample t-test
## 
## data:  box_heavy_state_change$Prevalence by box_heavy_state_change$Sexes
## t = -2.0565, df = 81.787, p-value = 0.04292
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.74837005 -0.01241427
## sample estimates:
##   mean in group Males mean in group Females 
##              1.394118              1.774510
## 
##  Welch Two Sample t-test
## 
## data:  box_binge_state_change$Prevalence by box_binge_state_change$Sexes
## t = -2.0784, df = 98.289, p-value = 0.04028
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.00422600 -0.02322498
## sample estimates:
##   mean in group Males mean in group Females 
##              1.525490              2.039216

Female drinking prevalence increased sinificently higher than male drinking prevalence in both heavy and binge drinking prevalence (heavy: p-value = 0.04292, binge: p-value = 0.04028). Because of this difference, I analyze whether the drinking prevalence differ by sex in the context of geographic locations.

Results

To analyze whether geographic location (different states) affect alcohol consumption level, I make 2X3 (heavy/binge vs. both/females/males) maps to visualize geographic impact.

## Reading layer `acs_2012_2016_county_us_B27001' from data source `/Users/slin/BMIN503_Final_Project/acs_2012_2016_county_us_B27001/acs_2012_2016_county_us_B27001.shp' using driver `ESRI Shapefile'
## Simple feature collection with 3141 features and 5 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -2031905 ymin: -2427680 xmax: 2516374 ymax: 732103.3
## epsg (SRID):    NA
## proj4string:    +proj=laea +lat_0=45 +lon_0=-100 +x_0=0 +y_0=0 +a=6370997 +b=6370997 +units=m +no_defs
#See the changes at state level
#import state code, clean up
State_code <- read_excel("State Code.xlsx")
Code.update <- State_code$Code
Code.new <- str_pad(Code.update, 2, pad = "0")
State_code <- State_code%>%
  mutate(Code = as.character(Code.new))

#join state code to alcohol_heavy_state_change
alcohol_heavy_state_change <- inner_join(alcohol_heavy_state_change, State_code, by = "State")

#join state code to alcohol_binge_state_change
alcohol_binge_state_change <- inner_join(alcohol_binge_state_change, State_code, by = "State")

#aggregate map information by state, join to cleaned heavy/binge data
US_states <- US_sf%>%
  dplyr::select(geometry)%>%
  aggregate(by = list(US_sf$Code), FUN = mean)%>%
  rename(Code = "Group.1")%>%
  mutate(Code = as.character(Code))
US_states_heavy <- inner_join(US_states, alcohol_heavy_state_change, by = "Code")
US_states_binge <- inner_join(US_states, alcohol_binge_state_change, by = "Code")

#plot
#set min&max for heavy
prev_h_both_min <- sort(US_states_heavy$Both)[1]
prev_h_both_max <- sort(US_states_heavy$Both, decreasing = TRUE)[1]
prev_h_male_min <- sort(US_states_heavy$Males)[1]
prev_h_male_max <- sort(US_states_heavy$Males, decreasing = TRUE)[1]
prev_h_female_min <- sort(US_states_heavy$Females)[1]
prev_h_female_max <- sort(US_states_heavy$Females, decreasing = TRUE)[1]

#set min&max for binge
prev_b_both_min <- sort(US_states_binge$Both)[1]
prev_b_both_max <- sort(US_states_binge$Both, decreasing = TRUE)[1]
prev_b_male_min <- sort(US_states_binge$Males)[1]
prev_b_male_max <- sort(US_states_binge$Males, decreasing = TRUE)[1]
prev_b_female_min <- sort(US_states_binge$Females)[1]
prev_b_female_max <- sort(US_states_binge$Females, decreasing = TRUE)[1]

#set theme
my_theme <- function() {
  theme_minimal() +                                  
  theme(axis.line = element_blank(),                 
        axis.text = element_blank(),                 
        axis.title = element_blank(),
        panel.grid = element_line(color = "white"),  
        legend.key.size = unit(0.6, "cm"),          
        legend.text = element_text(size = 12),       
        legend.title = element_text(size = 12),
        plot.title = element_text(size = 16))      
}

#set color for both, males, and females
myPalette_both <- colorRampPalette(brewer.pal(9, "Greens"))
myPalette_male <- colorRampPalette(brewer.pal(9, "Blues"))
myPalette_female <- colorRampPalette(brewer.pal(9, "Reds"))

#map plot:heavy, both
ggplot() + 
  geom_sf(data = US_states_heavy, aes(fill = Both), lwd = 0) +
  my_theme() +
  ggtitle("Change in Age-standardized Prevalence of Heavy Drinking\n 2005 to 2012, Both Sexes") +
  scale_fill_gradientn(name = "Prevalence\nChange (%)", colours = myPalette_both(100),
                       limit = range(prev_h_both_min, prev_h_both_max)) +
  theme(plot.title = element_text(hjust = 0.5))

When sex is not taken into account, the alcohol use prevalence increased the most in the West North Central states, North East (binge drinking), Montana and Kentucky. And there is a drop in prevalence in Nevada in the binge drinking condition.

The increase in male drinking prevalence concentrates in the West North Central states, Kentucky, and Louisiana. Females contributes a lot in Montana for both heavy and binge drinking and in the Great Lakes region for the binge drinking. Additionally, both males and females drinking prevalence largely increased in the North East in the context of binge drinking.

I also check how the drinking prevalence changed over the ten years given that there is a large gap between 2002/2005 vs. 2012 alcohol consomption in heavy & binge drinking conditions.

#get national alcohol use values (2002-2012)
changes_national <- combine_alcohol%>%
  filter(State == "National")
changes_national <- changes_national[3:35]

#clean up
any_changes <- as.data.frame(matrix(as.numeric(changes_national[1,]), nrow = 3, ncol = 11))
heavy_changes <- as.data.frame(matrix(as.numeric(changes_national[2,]), nrow = 3, ncol = 11))
binge_changes <- as.data.frame(matrix(as.numeric(changes_national[3,]), nrow = 3, ncol = 11))
combine_changes_national <- rbind(any_changes, heavy_changes, binge_changes)
combine_changes_national$Degree <-c(rep("Any", 3), rep("Heavy", 3), rep("Binge",3))
combine_changes_national <- combine_changes_national%>%
  rename("2002" = "V1", "2003" = "V2", "2004" = "V3", "2005" = "V4", "2006" = "V5", "2007" = "V6", "2008" = "V7", "2009" = "V8", "2010" = "V9", "2011" = "V10", "2012" = "V11")

#Plot change in heavy alcohol use from 2002 to 2012
#clean up
combine_changes_national <- melt(combine_changes_national, id.vars = "Degree")%>%
  rename("Year" = "variable")
combine_changes_national$Sexes <- rep(c("Both", "Females", "Males"), 33)
combine_changes_national_heavy <- combine_changes_national%>%
  filter(Degree == "Heavy")
  #can use drop_na() to drop N/A data from 2002-2005

#plot: heavy prevalence, 2002-2012
ggplot(combine_changes_national_heavy, aes(x = Year, y = value, colour = Sexes)) + 
  geom_line(aes(x = Year, y = value, group = Sexes)) +
  scale_color_manual("Sexes", values = c("Both" = "green3", "Females" = "coral1", "Males" = "cornflowerblue")) +
  ylab("Change in heavy alcohol use from 2002 to 2012 (%)")+
  ggtitle("Prevalence Change from 2002 to 2012")+
  theme(plot.title = element_text(hjust = 0.5))
## Warning: Removed 9 rows containing missing values (geom_path).

The overall prevalence in binge drinking is higher than heavy drinking by looking at the scale of the y axis. The increase patterns are similar between males and females. In addition to that, I expected to see a large increase around 2009 because of the financial crisis in 2008. But the alcohol use prevalence largely increased from 2010 to 2011. Therefore, I hypothesize that people try to saving money from alcohol when their income is affected.

To verify my hypothesis, I further analyze the drinking prevalence from 2010 and 2011 when an ecomonic factor, median household income, is included. To visualize the comparison between median household income and drinking prevalence, I create maps that whith background showing the median household income and the dots on the maps representing drinking prevalence.

## Your original .Renviron will be backed up and stored in your R HOME directory if needed.
## Your API key has been stored in your .Renviron and can be accessed by Sys.getenv("CENSUS_API_KEY"). 
## To use now, restart R or run `readRenviron("~/.Renviron")`
## [1] "d38c26d19748e1f9eaaa8bde2d2fe0c6b6b0d606"
## Getting data from the 2006-2010 5-year ACS
## Getting data from the 2007-2011 5-year ACS
## Warning: Column `State` joining character vector and factor, coercing into character vector
## Warning: Column `State` joining character vector and factor, coercing into character vector

The background is the median household income, which increased from 2010 to 2011 while both of them are in the range from 35,000 to 75,000 dollars. The dots on the maps represent the drinking prevalence for each states. The states in the Northern East Coast, the West Coast, Alaska and Hawaii have higher drinking prevalence as well as higher median household income. However, for the states in the North West, the drinking prevalences are high even though the median household income is not as high as the Northern East Coast and the West Coast. In addition, the prevalence patterns are similar to each other across the two years and between heavy and binge drinking degrees.

To quantitatively analyze the correlation between median household income and drinking prevalence, I first made scatter plots to compare the distribution of data points and differences in median houshold incomes, drinking prevalences, and trend lines. Then, I used Spearman’s rank correlation test to test whether correlation exist between the median household income and the drinking prevalence.

From these plots I can tell that the median household incomes are higher in 2011 by looking at the x-axis. The prevalence are also higher while the trend lines are similar within heavy drinking and binge drinking.

## 
##  Spearman's rank correlation rho
## 
## data:  corr.h.2010 and corr.i.2010
## S = 13213, p-value = 0.009049
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##      rho 
## 0.365515
## 
##  Spearman's rank correlation rho
## 
## data:  corr.b.2010 and corr.i.2010
## S = 12906, p-value = 0.006451
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.3802523
## 
##  Spearman's rank correlation rho
## 
## data:  corr.h.2011 and corr.i.2011
## S = 13917, p-value = 0.0186
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.3317397
## 
##  Spearman's rank correlation rho
## 
## data:  corr.b.2011 and corr.i.2011
## S = 11855, p-value = 0.001792
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.4307471

The p-values are positive for all four conditions with the rho coefficient ranged from 0.33 to 0.43 (Heavy 2010: p-value = 0.009049, rho = 0.365515; Binge 2010: p-value = 0.006451, rho = 0.3802523; Heavy 2011: p-value = 0.0186, rho = 0.3317397; Binge 2011: p-value = 0.001792, rho = 0.4307471). This means that the median household income has a positive correlation with excessive drinking prevalence: the increase in the household income means an increase in alcohol consumption. However, whether median household income can be a causation of the drinking prevalence needs further investigation and/or experiments.

Conclusion, Limitation, and Future Direction

Through my project I found that the increase in the female alcohol consumption prevalence is higher than in males (2005-2012). From the pespective of geography, most of the increases happen in the West North Central states, North East, Montana, and Kentucky. The increase of male drinking prevalence concentrates in the West North Central states and Kentucky, while Louisiana also stands out. On the other hand, females contributes a lot to the increase of drinking prevalence in Montana for both heavy and binge drinking and in the Great Lakes region for the binge drinking. This project also verified my hypothesis that the median household income has a positive correlation with heavy and binge drinking prevalence. While whether the median household income is a causation of the drinking prevalence remains unknown, states with higher median household income could potentially put more effort into limiting alcohol consumption.

One limitation of my project is that the median household income is not adjusted by local prices. Additionally, the raw data shows the prevalence instead of population of drinking, which could bring potential bias. In this project, I have not looked into specific states such as Montana and Kentucky. One of the future directions could focus on finding out why states like Montana and Kentucky have higher drinking prevalence. Another direction of my project could be including other factors such as unemployment level and social cohesion, as both of these factors had been suggested to affect the drinking prevalence in a previous study (Blomgren, et. al, 2004).